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ABSTRACT 

The  estimation  of  the  parameters  of  a  linear  statistical 
model  is  generally  accomplished  by  the  method  of  least 
squares.   However,  when  the  method  of  least  squares  is 
applied  to  nonorthogonal  problems  the  resulting  estimates 
may  be  significantly  different  from  the  true  parameters. 
The  method  of  ridge  regression  may  provide  better  estimates 
in  these  cases;  however,  a  probability  distribution  of  the 
ridge  estimator  is  presently  not  known.   The  form  of  such  a 
distribution  is  dependent  upon  how  the  ridge  parameter,  k, 
is  selected.   Two  possible  objective  methods  of  choosing  k 
are  examined  to  determine  if  either  one  leads  to  a  useful 
probability  distribution. 
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I.   BACKGROUND 

The  following  conventions  will  be  used  throughout. 
Unless  otherwise  noted,  capital  letters  and  Greek  letters 
will  refer  to  matrices  and  vectors  while  lower  case  letters 
will  refer  to  scalars. 

A.   INTRODUCTION 

The  use  of  linear  statistical  models  is  widespread  in 
scientific  fields  of  all  kinds.   Generally,  the  linear 
statistical  model  is  postulated  as 

Y  =  X3  +  e  (1) 

where  Y  is  an  n  x  1  vector  of  n  observed  values  of  a 
dependent  variable,  X  is  an  n  x  p  matrix  containing  n 
values  for  each  of  p  predictor  (independent)  variables, 
3  is  a  p  x  1  vector  of  p  unknown  parameters  (or  coefficients) 
to  be  estimated  from  data,  and  e  is  an  n  x  1  vector  repre- 
senting experimental  errors.   Usually,  the  experimental 
error  is  assumed  to  have  a  multivariate  normal  distribution 

with  mean  equal  to  zero  and  variance  covariance  matrix 

2         2 
equal  to  a  I  where  a   is  the  scalar  value  of  the  common 

variance  of  the  experimental  errors.   This  assumption 

will  be  made  throughout  this  paper. 

In  practice,  the  modeling  problem  is  to  estimate  the 

parameters  3  from  data  Y  and  X.   The  most  common  method  of 


doing  this  is  called  least  squares  estimation  or  some- 
times ordinary  least  squares  (OLS) .   The  latter  designation 
will  be  used  in  this  paper. 

Under  certain  fairly  general  and  common  conditions 
OLS  is  an  adequate  method  of  estimating  3.   However,  when 
the  data  is  "ill-conditioned"  or  nonorthogonal  OLS  may 
yield  poor  estimates  of  the  true  parameters. 

Ridge  regression  (RR)  has  been  proposed  [Ref.  1]  as  an 
alternative  estimation  method  that  might  yield  better  esti- 
mates under  conditions  where  OLS  does  poorly. 

B.   ORDINARY  LEAST  SQUARES 

For  convenience,  it  is  assumed  that  the  elements  of  X 
are  scaled  such  that  X'X  has  the  form  of  a  correlation 
matrix.   This  is  done  by  forming  from  each  element  x. .  a 
new  element  x'.   such  that 


x' . .  =  (x. .  -  x.)/s  (2) 

ij    v  13     y'    Xj  <■  J 

where  x.  is  the  mean  value  of  the  elements  of  the  j — 
3 

independent  variable  and  s    is  its  standard  deviation 

xj 

times  an  appropriate  constant  such  that  the  diagonal 
elements  of  X'X  are  equal  to  one.   The  OLS  estimator  of 
3  is  then 


3  =  (X'X)"1X,Y  (3) 


- 1  A 

so  long  as  (X'X)    exists.1   The  estimator  3  is  unique, 
unbiased  and  is  the  best  linear  unbiased  estimator  (BLUE) 
of  3  (it  has  the  minimum  variance  among  all  linear  un- 
biased estimators  of  3)  so  long  as  E (Y)  =  X3  and 

2         2 
E(Y  -X3  )  (Y  -X3)'  =  a  I  where  a  is  a  scalar,  as  assumed 

previously. 

The  OLS  estimator  3  is  commonly  used  and  is  particularly 

useful  when  it  can  be  assumed  that  Y  is  a  multivariate 

normal  vector  with  mean  vector  X3  and  covariance  matrix 

2 
a  I.   In  this  case,  it  can  be  shown2  that  the  maximum 

likelihood  estimator  of  3  is  the  same  as  the  OLS  estimator 

and  furthermore,  since  3  is  a  linear  function  of  the  elements 

of  Y,  3  has  a  multivariate  normal  distribution  with  mean 

2     -1 
vector  equal  to  3  and  covariance  matrix  a    (X'X)   .   This 

latter  characteristic  of  3  allows  the  use  of  hypothesis 

tests  and  the  computation  of  confidence  bounds. 

Unfortunately,  in  some  cases  X'X  is  "ill-conditioned" 

and  OLS  yields  poor  estimates.   This  typically  occurs  when 

an  experiment  is  poorly  designed  or  there  are  economic  or 

physical  restraints  causing  strong  correlations  among  the 

predictor  variables.   In  this  case  X'X,  in  its  correlation 

matrix  form,  will  not  be  orthogonal. 


;For  a  derivation  and  details  of  properties  of  the  OLS 
estimator,  see,  for  example,  Ref.  2. 

2For  example,  see  Ref.  2,  page  182. 


Hoerl  and  Kennard  [Ref.  3]  address  the  eigenvalues  of 

X'X  (denoted  by  A,,  j  =  1,  2 p)  and  point  out  that 

nonorthogonal  data  are  characterized  by  the  smallest  eigen- 
value Omin)  being  much  less  than  unity  and  that,  since 

a  /A  -   is  a  lower  bound  for  the  mean  squared  distance 

'  mm  n 

between  3  and  3,  then  for  X'X  nonorthogonal,  the  difference 
between  3  and  3  has  a  high  probability  of  being  large. 
When  X'X  is  nonorthogonal  3  is  characterized  by  one  or  more 
of  the  following  difficulties,  for  example: 

(1)  large  variance, 

(2)  large  magnitude  of  residual  errors, 

(3)  incorrect  signs  of  parameter 
estimates. 

C.   RIDGE  REGRESSION 

A.  E.  Hoerl  suggested  [Refs.  1  and  4]  that  the  large 
variance  of  3  for  nonorthogonal  data  could  be  reduced  by 
the  addition  of  a  constant  k  >  0  to  the  diagonal  elements  of 
X'X,  thus  yielding 


3*  =  (X'X  +  kl)"1  X'Y  (4) 


as  as  estimator.   Equation  (4)  is  derived  in  Appendix  A. 
Note  that  for  k  equal  to  zero  the  estimator  3   is  equal 
to  the  OLS  estimator  3.   Therefore,  OLS  can  be  thought  of 
as  a  special  case  of  ridge  regression. 3   Hoerl  suggested 


3See  Appendix  B  for  a  discussion  of  an  even  more 
general  estimator. 
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the  name  "ridge  regression"  for  this  procedure  because  of 
its  mathematical  similarity  to  some  of  his  earlier  work 
[Ref.  5]  on  quadratic  response  functions.   Appendix  A 
contains  a  derivation  of  the  ridge  regression  estimator. 
1.   Mean  Squared  Error 

The  rationale  behind  using  the  ridge  estimator  is 
to  minimize  the  mean  squared  error  (MSE)  associated  with 
the  estimate  instead  of  minimizing  the  sum  of  squares  of 
residuals  as  is  done  in  OLS.1*   Hoerl  and  Kennard  show 
that  the  mean  squared  error  is  given  by 


MSE  =  Variance  +  (Bias)2  (5) 


Furthermore,  they  show  that  variance  is  a  monotonically 
decreasing  function  of  k,  that  the  squared  bias  is  a 
monotonically  increasing  function  of  k  and  that  the  rate 
of  change  of  variance,  for  nonorthogonal  data  and  small  k, 
is  considerably  larger  than  the  rate  of  change  of  the 
squared  bias.   Figure  1  is  a  graphical  illustration  of 
these  relationships.   Hoerl  and  Kennard  argue  that  it  is 
possible  to  find  some  k  >^  0  such  that  the  variance  is 
greatly  reduced  while  only  a  small  amount  of  bias  is  intro 
duced,  thus  yielding  a  smaller  MSE  than  if  OLS  (k  =  0) 


''In  the  case  of  unbiased  estimation,  which  OLS  is, 
these  are  equivalent  criteria. 


11 


FIGURE  1 
MEAN  SQUARED  ERROR  FUNCTIONS 
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were  used.   Indeed  they  show  that  if  3' 3  is  bounded,  then 
such  a  k  always  exists.-  Thus,  proper  use  of  ridge  regres- 
sion on  nonorthogonal  data  insures  a  reduced  MSE  of 
estimation. 

The  problem  remains  to  select  an  appropriate 
value  of  k.   Hoerl  and  Kennard  [Ref.  6]  suggest  the  use  of 
two  graphical  devices  as  aids  to  determining  an  appropriate 
value  of  k.   The  first  is  the  ridge  trace,  a  two-dimensional 
plot  of  the  elements  of  3   as  functions  of  k  and  the  second 
is  an  estimate  of  the  squared  length  of  the  coefficient 
vector  3   3  .   The  ridge  trace  is  used  to  gain  an  under- 
standing of  the  underlying  correlations  between  the  various 
predictor  variables  while  the  plot  of  3   3   is  used  to 
subjectively  determine  a  suitable  range  of  values  of  k. 
A  typical  ridge  trace  is  illustrated  in  Figure  2  and  a 
typical  plot  of  3   3   is  depicted  in  Figure  3.   Notice 
that  3   3  ,  in  Figure  3,  decreases  steeply  for  small  k 
(k  <  0.2)  but  in  the  range  about  0.3  to  0.4  has  become 
much  less  sensitive  to  further  increases  in  k. 
2.   Alternative  Methods  of  Choosing  k 

The  previously  described  method  of  subjectively 
choosing  a  suitable  value  of  k  is  the  current  method  in 
use  and  appears  to  be  useful.   A  major  problem  arises, 
however,  because  the  method  denies  to  the  analyst  know- 
ledge  of  the  probability  distribution  of  3   and,  therefore, 
any  probabilistic  inferences  concerning  the  resulting 
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FIGURE  2 
TYPICAL  RIDGE  TRACE 
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FIGURE  3 
TYPICAL  PLOT  OF  THE 
SQUARED  LENGTH  OF  THE  RIDGE  ESTIMATOR 
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estimator.   Hoerl  and  Kennard  have  suggested  a  general  form 

of  ridge  regression  [Ref.  3]  and  an  iterative  method  of 

determining  k.   In  addition,  Hemmerle  [Ref.  7]  has  derived 

a  closed  form  solution  based  on  this  method.   Another 

possibility  is  to  use  the  ridge  trace  or  the  plot  of 

3   3   quantitatively  to  calculate  a  point  value  for  k  in 

such  a  way  that  the  marginal  probability  distribution,  f^*, 

3 
may  be  determined.   Two  such  methods  using  the  ridge  trace 

are  examined  in  the  next  section. 
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II.   PROPOSED  OBJECTIVE  RULES  FOR  CHOOSING  k 

The  slope  (rate  of  change)  of  the  ridge  trace  curves 
or  the  absolute  change  of  the  ridge  trace  curves  over  a 
specified  interval  may  be  used  to  determine  a  value  of  the 
ridge  parameter,  k,  objectively.   These  criteria  are 
discussed  here. 

Either  of  these  criteria  may  be  sensitive  to  the 
behavior  of  each  coefficient  3-.   In  general,  $•  is  not 
monotonic  in  k,  although  they  all  approach  zero  as  k  is 
increased  without  bound.   It  has  been  noted  by  Marquardt 
and  Snee  [Ref.  8]  that  it  is  not  uncommon  for  one  or  more 
3-  to  increase  in  absolute  value  as  k  is  increased.   (See, 
for  example,  3fi  in  Figure  2.)   Therefore,  the  ridge  trace 
should  be  examined  by  the  analyst  to  detect  any  behavior 
of  3-  that  might  adversely  affect  the  proper  selection  of  k 
even  though  the  ridge  trace  is  not  to  be  used  directly  to 

select  a  specific  value  of  k. 

a* 

It  is  clear  that  3   is  distributed  multivariate  normal 

if  Y  is  distributed  multivariate  normal  and  a  specific 
value  of  k  is  selected  a  priori.   However,  whenever  the 
value  of  k  is  dependent  on  a  data  sample  its  value  will 
not  generally  be  the  same  for  each  data  sample.   Therefore, 
k  is  a  random  variable.   Let  K  denote  this  (scalar)  random 
variable. 
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The  marginal  probability  distribution  of  3  may  be 
derived  from  the  joint  probability  distribution  of  K  and 
3  which  can  be  determined  by 


f^a   ■  f„*    •  fv  (6) 

3  ,K    3  /K    * 

if  the  conditional  distribution  of  3   given  K,  fg* ,„,  and 
the  marginal  distribution  of  K,  f„,  are  known.   As  stated 
above,  when  K  is  given,  the  distribution  of  3   is  known. 
It  remains  to  determine  the  marginal  of  K,  f^.   Clearly, 
this  distribution  depends  on  how  K  is  related  to  Y.   The 
procedure  will  be  to  find  a  mapping  from  the  range  of  Y 
into  the  range  of  K  which  gives  the  marginal  distribution  of 
K.   With  this  distribution  and  the  known  conditional  distri- 
bution  of  3   given  K,  the  joint  distribution  of  3   and  K  may 
be  determined.   It  is  convenient  to  consider  the  cumulative 
distribution  function,  F„(k),  since,  if  the  functional 
relationship  of  K  to  Y,  K  =  h(Y),  is  known  then 

FK(k)  =  P[K  <  k]  =  P[h(Y)  <  k]  =  P[Y£Rk]        (7) 

where  R,  is  a  region  in  the  space  of  Y  corresponding  to 
h(Y)  <_  k.   Thus  if  R,  can  be  determined  then,  since  the 
marginal  distribution  of  Y  is  known,  FK(k)  =  P[YeR,  ]  can 
be  determined  and  f^  may  be  determined  from  F^  by 
differentiation.   It  remains  to  determine  R,  corresponding 
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to  a  specified  region  in  the  space  of  K  and  an  objective 
rule  for  mapping  from  Y  to  K. 

A.   ABSOLUTE  VALUE  CRITERION 

The  practical  range  of  the  ridge  parameter  is  taken  to 
be  0  <  k  <  1  in  the  literature.   It  seems  reasonable  then 
to  choose  the  smallest  value  of  k  such  that  all  3-(k)  are 
close  to  their  respective  values  at  k  =  1.   In  other  words, 

|3*(k)  -  3*(1)|  <  6.;  i  =  1,  2,  .  .  .,  p       (8) 


where  6 .  is  a  constant  selected  by  the  analyst.   The  cri- 
terion expressed  by  (8)  means  that  the  ridge  trace  curves, 
3-,  at  k  are  within  6.  of  their  value  at  k  =  1  beyond  which 
there  is  no  interest.   Here  6.  refers  to  the  i —  scalar 
component  of  a  p  x  1  vector,  6.   Suppose  that  at  some 
k  =  kfi  the  m —  component  of  the  left  hand  size  of  (1)  is 
the  one  whose  absolute  magnitude  is  largest.   Define  a 
p  x  1  vector  x  such  that  t   =  ±5  ,  as  appropriate,  and  the 
other  components  of  x  are  equal  to  the  corresponding  values 
of  |3i(kQ)  -  $ .(1) I .   Then  equation  (8)  can  be  rewritten 
in  vector  form 


3  (kQ)  -  3  (1)  =  x  (9) 
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B.   DERIVATIVE  CRITERION 

Another  potential  criterion  to  use  for  selecting  k  is 
to  require  that  the  slopes  of  all  $.  be  "flat  enough"  in 
the  sense  that 


33,  (k) 

-j| =  6i;  i  =  1,  2,  .  .  .,  p  (10) 


where  6.  is  as  previously  defined.   Define  m  such  that  the 

■f-V> 

m —  component  of  the  left  hand  side  of  (10)  is  the  one 
whose  absolute  magnitude  is  largest  and  define  a  p  x  1 
vector  7T  such  that  Trm  =  ±<5m,  as  appropriate,  and  the  other 
components  of  tt  are  equal  to  the  corresponding  values  of 
■  33  . 


9k  '' 

Then  equation  (1)  can  be  written,  in  vector  form 


4!£i-*  ai) 
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III.   PROBLEM 

The  problem  is  to  determine  the  probability 
distribution  of  K  given  Y.   It  is  proposed  to  determine 
this  by  attempting  to  derive  and  examine  the  functional 
relationship  of  Y  and  K. 

A.   ABSOLUTE  VALUE  CRITERION 

The  criterion  expressed  by  equation  (9)  may  be  stated, 
by  substituting  from  equation  (4) 


(X'X  +  kI)'1X'Y  -  (X'X  +  I)"1X'Y  =  x         (12) 


and  by  factoring 

[(X'X  +  kl)"1  -  (X'X  +  I)"1]X'Y  =  t         (13) 

but,  as  shown  in  Appendix  C,  equation  (C-4) ,  the  expression 
in  brackets  may  be  expanded  to 


(X'X  +  kI)"1[(X'X  +  I)  -  (X'X  +  kl)]  (X'X  +  I)"1    (14) 


Therefore,  by  canceling  terms  and  simplifying,  equation  (13) 
becomes 

(1  -  k)(X'X  +  kI)"1(X»X  +  I)_1X'Y  =  x         (15) 
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If  k  f   1  and  if  (X'X  +  kl)"1  and  (X'X  +  I)"1  exist,  then 


X'Y  =  (_i-1-)(X'X  +  kI)(X'X  +  I)t 


(16) 


The  task  then  is  to  solve  the  linear  equations  in  (16) 
for  Y  in  order  to  determine  R,  .   Unfortunately,  equation  (18) 
represents  p  linear  restraints  (hyperplanes)  on  n  unknown 
variables  where,  in  general,  n  >  p.   Furthermore,  x  is  a 
function  of  Y.   Thus,  R,  is  not  easily  determined  under 
this  criterion. 

B.   DERIVATIVE  CRITERION 

The  criterion  given  by  equation  (11)  may  be  stated  by 
substituting  from  equation  (4) 


|^[(X'X   +   kI)_1X'Y]    =   77 


(17) 


or  since  ^X'X  +   kI)    =    x 


-(X'X  +   kI)~2X'Y 


8tt 
9¥ 


(18) 


Now,    if    (X'X  +   kl)    is   not   singular   then 


X'Y   =    (X'X   +   kl)2  |£ 


(19) 
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where  the  negative  sign  has  been  dropped  since  the 
criterion  actually  specifies  the  absolute  value  of  the 
components  of  the  derivative  and  the  notation  of  tt  accounts 
for  proper  signs. 

Equation  (19)  is  similar  to  equation  (16) ,  as  it  should 
be  since  the  criteria  are  similar,  and  the  same  difficulties 
are  encountered  in  determining  R,  as  for  the  previous 
criterion.   In  addition,  the  derivative  of  it  will  be 
difficult  to  determine.   Therefore,  the  derivative 
criterion  does  not  lead  to  a  useful  result  either. 
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IV.   NOTES  ON  THE  FULL  BAYESIAN  RIDGE  ESTIMATOR 

The  full  Bayesian  ridge  estimator  (FBRE)  is  suggested 
by  Eskew  [Ref.  9]  and  is  given  as 


3*  =  (X'X  +  kI)"1(X'Y  +  k3Q)  (20) 


where  3n  is  a  prior  estimate  of  3.   There  are  two  interesting 
properties  of  £  not  noted  by  Eskew. 

First  suppose  that  the  prior  $n  is  chosen  to  be  the 
OLS  estimate  3.   Then 


3*  =  (X'X  +  k!)"1[X'Y  +  k(X'X)_1X'Y]  (21) 


and  hence 


3*  =  (X'X  +  kl)_1[l  +  k(X«X)"1]X'Y  (22) 


But 


[I  +  k(X'X)"1]  =  (X'X  +  kI)(X'X)_1  (23) 


Substituting  (23)  into  (22) 


3*  =  (X'X)'1X'Y  =  3  (24) 
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Thus  if  the  OLS  estimator  is  used  as  a  prior  estimate 
for  the  FBRE,  equation  (21),  then  the  resulting  estimate  is 
equal  to  the  OLS  estimate. 

Now,  suppose  that  any  prior  estimate  3n  is  used  in 
equation  (21)  but  the  resulting  estimate  is  then  used  as  a 
prior  in  (21)  to  compute  another  estimate.   If  this  pro- 
cedure is  repeated  indefinitely,  in  the  limit  the  result 
will  again  be  the  OLS  estimator  regardless  of  what  prior, 
3~,  was  initially  used.   The  proof  of  this  is  shown  in 
Appendix  B. 
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V.   CONCLUSIONS  AND  RECOMMENDATIONS 

A.  CONCLUSIONS 

The  determination  of  a  probability  distribution  of  the 
ridge  estimator,  3  ,  is  desirable  in  order  to  facilitate 
the  use  of  hypothesis  tests  and  the  computation  of  confi- 

dence  bounds  concerning  3  .   The  probability  distribution 

~* 

of  3   depends  on  the  objective  rule  used  to  select  the 

ridge  parameter,  k.   Neither  of  the  two  objective  rules 
examined  here  appears  to  lead  to  a  simply  determined 
probability  distribution. 

B.  RECOMMENDATIONS 

The  search  for  a  useful  probability  distribution  of 
3   should  be  pursued  further.   In  particular,  the  closed 
form  solution  for  k  presented  by  Hemmerle  [Ref.  7]  may 
prove  fruitful.   Other  possibilities  include  investigating 
other  criteria  based  on  the  ridge  trace  such  as  minimizing 
the  sum  of  squares,  over  all  i  =  1,  2,  .  .  .,  p,  of  the 
difference  between  3-(k)  and  3^(1).   Also,  the  same 
criteria  applied  to  the  ridge  trace  could  be  considered 
for  the  squared  length  of  3  . 
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APPENDIX  A 
DERIVATION  OF  THE  RIDGE  REGRESSION  ESTIMATOR 

The  residual  sum  of  squares  for  any  estimator  can  be 
written 

<D(3)  =  (Y  -  X3)'  (Y  -  X3)  =  e'e  (A-l) 

In  ridge  regression  it  is  desirable  to  minimize  the 
residual  sum  of  squares  subject  to  an  acceptable  length, 
c,  of  the  regression  vector  3  .   Expressed  as  a  Lagrangian 
restraint  problem  this  is 


min  $'(3*)  =  (Y  -  X3*)  '  (Y  -  X3*)  +  k(3*'3*  -  c)         (A-2) 


where  k  is  the  inverse  of  the  Lagrangian  multiplier. 

Taking  partial  derivatives  of  $'  with  respect  to  3 
and  setting  them  equal  to  zero 
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4*-  [Y'Y  -  Y»X3   -  3   X'Y  +  3   X'X3   +  k3   3  ]     (A-3) 
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Hence 


0  =  -(Y'X)'  -  X'Y  +  2X'X3   +  2k3  (A-4) 
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or 


2X'Y  =  2X'X6*  +  2kI3*  (A-5) 


Therefore, 


X'Y  =  (X'X  +  kl)$*  (A-6) 


Now,  if  (X'X  +  kl)  is  non-singular  (which  k  is  selected 
to  ensure) ,  then 

3*  =  (X'X  +  kI)-1X'Y  (A- 7) 


28 


APPENDIX  B 
FULL  BAYESIAN  RIDGE  ESTIMATION 

A.   BACKGROUND 

Eskew  [Ref.  9]  points  out  that  ridge  estimation  is 
equivalent  to  minimizing  the  squared  differences  between 
the  regression  estimates  and  a  prior  estimate  of  zero 
subject  to  a  constraint  on  the  sum  of  squares  and  suggests 
that  a  non-zero  prior  might  be  more  reasonable.   Following 
this  line  of  reasoning  he  derives  the  full  Bayesian  ridge 
estimator  (FBRE) 


jg  1 

3   =  (X'X  +  kl)   (X.'Y  +  k6Q)  (B-l) 


where  3n  is  a  prior  estimate  of  the  true  parameters  3. 
Note  that  the  ridge  estimator  is  a  special  case  of  FBRE 
where  the  prior  is  taken  to  be  zero. 

Eskew  shows  that  the  variance  of  the  FBRE  is  the  same 
as  the  variance  of  the  ridge  regression  estimator  (RRE) 
while  the  squared  bias  of  the  FBRE  is  less  than  that  for 
the  RRE,  thereby  resulting  in  a  reduction  of  mean  squared 
error. 

B.   ITERATIVE  USE  OF  THE  FULL  BAYESIAN  RIDGE  ESTIMATOR 

Suppose  that  the  FBRE  is  calculated  using  any  prior, 
6n,  and  then  the  result,  $_,  ,  is  used  as  a  prior  to 
calculate  another  FBRE,  3_2.   If  this  procedure  is  repeated 
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m  times  the  result  may  be  written 


m 


j$      =  (1/k)  H  (kA)1  X'Y  +  (kA)m6n  (B-2) 

-hu         i=1  u 


where  A  =  (X'X  +  kl)~  .   It  is  interesting  to  determine  the 
form  of  8   in  the  limit  as  m  approaches  infinity.   Since  A 
and  X'X  are  positive  definite  matrices  their  eigenvalues 
are  positive.   Let  X.    >  0  be  an  eigenvalue  of  A  and  p.  >  0 
be  an  eigenvalue  of  X'X.   Hoerl  and  Kennard  show  the  rela- 
tionship between  X.  and  p.  to  be 


X.  =  l/(p.  +  k)  (B-3) 


Now  there  exists  an  orthogonal  p  x  p  matrix  P  with  P'P  =  I 
such  that 


P'AP  =  diag(X1,  A2,  .  .  .,  X  )  (B-4) 


or  since  the  eigenvalues  of  kA  are  kX.  and  the  eigenvalues 
of  A  are  (X.) 


P'(kA)  P  =  diag(k1X1,  k2X2,  .  .  .,  k  X  )        (B-5) 


Now 


P'[  lim(kA)m]P  =   lim  P' (kA)mP  (B-6) 
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The  right  hand  side  of  (B-6)  is  the  limit  of  the  right  hand 
side  of  (B-5).   By  substituting  from  equation  (B-3)  a 
typical  diagonal  element  is  0  <  [k/(p.  +  k)]m  <  1,  since 
p^^  >  0  for  all  i  =  1,  2,  .  .  .,  P.   Therefore,  each  of  the 
elements  of  the  right  hand  side  of  (B-5)  approaches  zero 
as  m  approaches  infinity.   Hence 


P'  lim(kA)m  P  =  0  (B-7) 

IIH-co 


This  can  only  occur  if 


lim(kA)m  =  0  (B-8) 

m-*» 


Therefore,  the  last  term  of  equation  (B-2)  is  zero  in  the 
limit.   Now  define  a  matrix  function  S  =  S(kA)  where 


S  =  XI     (kA)1  (B-9) 

i=l 


DeRusso,  Roy,  and  Close  [Ref.  10]  show  that  S(kA)  converges 
if  and  only  if  S(kX.)  converges  for  all  kX.,  the  eigenvalues 
of  kA.   Clearly  this  will  occur  if  and  only  if 


kX    I  <  1  (B-10) 

max '  v    ' 
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Substituting  equation  (B-3) 


|k/(Pmin  +  k)|  <  1  (B-ll) 


or,  after  some  algebra 


p  -   >  -2k   and   p  .   >  0  (B-12) 

mm  mm  v    J 


Since  Pmin  is  an  eigenvalue  of  a  positive  definite  matrix, 
X'X,  then  Pmin  >  0  and  both  conditions  of  (B-13)  are  met. 
Therefore  S(kA)  does  converge.   To  see  what  it  converges  to, 
define  S1  =  S  +  I  and  multiply  S'  on  the  left  by  (I  -  kA) 


(I  -  kA)S'  =  (I  -  kA)(I  +  kA  +  (kA) 2  +  .  .  .)     (B-13) 


and  multiplying  the  right  hand  side  out 


(I  -  kA)S'  =  [I  +  kA  +  (kA)2  +...]-  [kA  +  (kA) 2  +  .  .  .] 


=  I  (B-14) 


Then 


S'  =  (I  -  kA)"1  (B-15) 
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Then 


S  =  [I[(kA)_1  -  I]kA]_1  -  I 


=  (l/k)A'1[(l/k)A"1  -  I]"1  -I  (B-16) 


Substituting  A  =  (X'X  +  kl)"1 


S  =  [(l/k)X'X  +  I]  [(l/kJX'X]'1  -  I 

=  kCX'X)"1  (B-17) 

Substituting  S  into  equation  (B-2) 

lim  8   =  (l/k)k(X'X)"iXlY  (B-18) 


m-><» 


Therefore 


lim  L*  =  (X'X)'1X,Y  =  6  (B-19) 

m+<» 


Thus  the  iterative  procedure,  starting  with  any  prior  $0, 
converges  to  the  OLS  estimator,  g. 
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APPENDIX  C 
MISCELLANEOUS  MATRIX  ALGEBRA  AND  CALCULUS 

Let  A,  B,  and  C  denote  m  x  n  matrices.   Denote  their 
inverses  by  A   ,  B   ,  and  C"  ,  respectively. 

A.   MATRIX  ALGEBRA 
First,  note  that 


since 


C(A  +  B)"1  =  (AC-1  +  BC'1)"1  (C-l) 


C(A  +  B)"1  =  [(A  +  B)C"1]"1  (C-2) 


Also 


since 


=  (AC-1  +  BC"1)"1  (C-3) 


A"1  ±  B"1  =  A"1^  ±  A)B_1  =  B"1(B  ±  A)A"1       (C-4) 


A'1(B  ±  AJB'1  =  (A-1B  ±  IJB-1 


=  (A"1  ±  B'1)  (C-5) 
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and 


B_1(B  ±  A)A_1  =  (I  ±  B"1A)A"1 


=  (A-1  ±  B"1)  (C-6) 


B.   MATRIX  CALCULUS 

Let  A(t) ,  B(t),  and  C(t)  denote  m  x  n  matrices  whose 

elements  may  be  functions  of  the  scalar  variable  t.   Let 

t        i 

A(t)  and  B(t)  denote  the  derivatives  of  A(t)  and  B(t), 

respectively,  with  respect  to  t. 

The  following  are  shown  to  be  true  by  DeRusso,  Roy, 

and  Close  [Ref .  10] . 


and 


^  A(t)B(t)  =  A(t)B(t)  +  A(t)B(t)  (C-7) 


d_.A_1(t)  =  -A'1(t)A(t)A"1(t)  (C-8) 
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